For example,Бобцов

Efficient sparse retrieval through embedding-based inverted index construction 

Annotation

Modern search engines use a two-stage architecture for efficient and high-quality search over large volumes of data. In the first stage, simple and fast algorithms like BM25 are applied, while in the second stage, more precise but resource- intensive methods methods, such as deep neural networks, are employed. Although this approach yields good results, it is fundamentally limited in quality due to the vocabulary mismatch problem inherent in the simple algorithms of the first stage. To address this issue, we propose an algorithm for constructing an inverted index using vector representations combining the advantages of both stages: the efficiency of the inverted index and the high search quality of vector models. In our work, we suggest creating a vector index that preserves the various semantic meanings of vocabulary tokens. For each token, we identify the documents in which it is used, and then cluster its contextualized embeddings. The centroids of the resulting clusters represent different semantic meanings of the tokens. This process forms an extended vocabulary which is used to build the inverted index. During index construction, similarity scores between each semantic meaning of a token and documents are calculated which are then used in the search process. This approach reduces the number of computations required for similarity estimation in real-time. Searching the inverted index first requires finding keys in the vector index, helping to solve the vocabulary mismatch problem. The operation of the algorithm is demonstrated on a search task within the SciFact dataset. It is shown that the proposed method achieves high search quality with low memory requirements. The proposed algorithm demonstrates high search quality, while maintaining a compact vector index whose size remains constant and depends only on the size of the vocabulary. The main drawback of the algorithm is the need to use a deep neural network to generate vector representations of queries during the search process which slows down this stage. Finding ways to address this issue and accelerate the search process represents a direction for future research.

Keywords

Articles in current issue